22 research outputs found

    Accumulated Gradient Normalization

    Get PDF
    This work addresses the instability in asynchronous data parallel optimization. It does so by introducing a novel distributed optimizer which is able to efficiently optimize a centralized model under communication constraints. The optimizer achieves this by pushing a normalized sequence of first-order gradients to a parameter server. This implies that the magnitude of a worker delta is smaller compared to an accumulated gradient, and provides a better direction towards a minimum compared to first-order gradients, which in turn also forces possible implicit momentum fluctuations to be more aligned since we make the assumption that all workers contribute towards a single minima. As a result, our approach mitigates the parameter staleness problem more effectively since staleness in asynchrony induces (implicit) momentum, and achieves a better convergence rate compared to other optimizers such as asynchronous EASGD and DynSGD, which we show empirically.Comment: 16 pages, 12 figures, ACML201

    Adversarial Variational Optimization of Non-Differentiable Simulators

    Full text link
    Complex computer simulators are increasingly used across fields of science as generative models tying parameters of an underlying theory to experimental observations. Inference in this setup is often difficult, as simulators rarely admit a tractable density or likelihood function. We introduce Adversarial Variational Optimization (AVO), a likelihood-free inference algorithm for fitting a non-differentiable generative model incorporating ideas from generative adversarial networks, variational optimization and empirical Bayes. We adapt the training procedure of generative adversarial networks by replacing the differentiable generative network with a domain-specific simulator. We solve the resulting non-differentiable minimax problem by minimizing variational upper bounds of the two adversarial objectives. Effectively, the procedure results in learning a proposal distribution over simulator parameters, such that the JS divergence between the marginal distribution of the synthetic data and the empirical distribution of observed data is minimized. We evaluate and compare the method with simulators producing both discrete and continuous data.Comment: v4: Final version published at AISTATS 2019; v5: Fixed typo in Eqn 1

    Gradient Energy Matching for Distributed Asynchronous Gradient Descent

    Full text link
    Distributed asynchronous SGD has become widely used for deep learning in large-scale systems, but remains notorious for its instability when increasing the number of workers. In this work, we study the dynamics of distributed asynchronous SGD under the lens of Lagrangian mechanics. Using this description, we introduce the concept of energy to describe the optimization process and derive a sufficient condition ensuring its stability as long as the collective energy induced by the active workers remains below the energy of a target synchronous process. Making use of this criterion, we derive a stable distributed asynchronous optimization procedure, GEM, that estimates and maintains the energy of the asynchronous system below or equal to the energy of sequential SGD with momentum. Experimental results highlight the stability and speedup of GEM compared to existing schemes, even when scaling to one hundred asynchronous workers. Results also indicate better generalization compared to the targeted SGD with momentum

    Advances in Simulation-Based Inference: Towards the automation of the Scientific Method through Learning Algorithms

    Full text link
    This dissertation presents several novel techniques and guidelines to advance the field of simulation-based inference. Simulation-based inference, or likelihood-free inference, refers to the process of statistical inference whenever simulating synthetic realizations x through detailed descriptions of their generating processes is possible, but evaluating the likelihood p(x | y) of parameters y tied to realizations x is intractable. What this effectively means is that while it is relatively simple to execute a computer simulation and collect samples from its generative process for various inputs y, it is rather difficult to invert the process where one poses the question: ``what set of parameters y could have been responsible producing x and what is their probability of doing that`` The likelihood p(x | y) plays a central role in answering this question. However, for most scientific simulators, the direct evaluation of the (true and unknown) likelihood involves solving an inverse problem that rests on the integration of all possible forward realizations implicitly defined by the computer code of the simulator. This issue is the core reason why it is typically impossible to evaluate the likelihood model of a computer simulator: it requires us to integrate across all possible code paths for all inputs y that could have potentially led to the realization x. Classical statistical inference based on the likelihood is for this reason impractical. Nevertheless, approximate inference remains possible by relying on surrogates that produce estimates of key quantities necessary for statistical inference. This thesis introduces various techniques and guidelines to effectively construct such surrogates and demonstrates how these approximations should be applied reliably. We explicitly make the point that the dogma of data efficiency should not be central to the field. Rather, reliable approximations should if we ever are to deduce scientific results with the techniques we developed over the years. This point is strengthened by demonstrating that all techniques can produce approximations that are not reliable from a scientific point of view, that is, when one is interested in constraining parameters or models. We argue for novel protocols that provide theoretically backed reliability properties. To that end, this thesis introduces a novel algorithm that provides such guarantees in terms of the binary classifier. In fact, the theoretical result is applicable to any binary classification problem. Finally, these contributions are framed within the context of the automation of science. This thesis concerned itself with the automation of the last step of the scientific method, which is described as a recurrence over the sequence hypothesis, experiment, and conclusion. For the most part, the steps of hypothesis formation and experiment design remain however solely for the scientists to decide. Only occasionally are they explored, designed and automated through computer-assisted means. For these two steps, we provide research avenues and proof of concepts that could unlock their automation

    Averting A Crisis In Simulation-Based Inference

    Full text link
    We present extensive empirical evidence showing that current Bayesian simulation-based inference algorithms are inadequate for the falsificationist methodology of scientific inquiry. Our results collected through months of experimental computations show that all benchmarked algorithms - (s)npe, (s)nre, snl and variants of abc – may produce overconfident posterior approximations, which makes them demonstrably unreliable and dangerous if one’s scientific goal is to constrain parameters of interest. We believe that failing to address this issue will lead to a well-founded trust crisis in simulation-based inference. For this reason, we argue that research efforts should now consider theoretical and methodological developments of conservative approximate inference algorithms and present research directions towards this objective. In this regard, we show empirical evidence that ensembles are consistently more reliable

    Gradient Energy Matching for Distributed Asynchronous Gradient Descent

    Full text link
    Distributed asynchronous SGD has become widely used for deep learning in large-scale systems, but remains notorious for its instability when increasing the number of workers. In this work, we study the dynamics of distributed asynchronous SGD under the lens of Lagrangian mechanics. Using this description, we introduce the concept of energy to describe the optimization process and derive a sufficient condition ensuring its stability as long as the collective energy induced by the active workers remains below the energy of a target synchronous process. Making use of this criterion, we derive a stable distributed asynchronous optimization procedure, GEM, that estimates and maintains the energy of the asynchronous system below or equal to the energy of sequential SGD with momentum. Experimental results highlight the stability and speedup of GEM compared to existing schemes, even when scaling to one hundred asynchronous workers. Results also indicate better generalization compared to the targeted SGD with momentum

    On Scalable Deep Learning and Parallelizing Gradient Descent

    No full text
    Speeding up gradient based methods has been a subject of interest over the past years with many practical applications, especially with respect to Deep Learning. Despite the fact that many optimizations have been done on a hardware level, the convergence rate of very large models remains problematic. Therefore, data parallel methods next to mini-batch parallelism have been suggested to further decrease the training time of parameterized models using gradient based methods. Nevertheless, asynchronous optimization was considered too unstable for practical purposes due to a lacking understanding of the underlying mechanisms. Recently, a theoretical contribution has been made which defines asynchronous optimization in terms of (implicit) momentum due to the presence of a queuing model of gradients based on past parameterizations. This thesis mainly builds upon this work to construct a better understanding why asynchronous optimization shows proportionally more divergent behavior when the number of parallel workers increases, and how this affects existing distributed optimization algorithms. Furthermore, using our redefinition of parameter staleness, we construct two novel techniques for asynchronous optimization, i.e., AGN and ADAG. This work shows that these methods outperform existing methods, and are more robust to (distributed) hyperparameterization contrary to existing distributed optimization algorithms such as DOWNPOUR, (A)EASGD, and DynSGD. Additionally, this thesis presents several smaller contributions. First, we show that the convergence rate of EASGD derived algorithms is impaired by an equilibrium condition. However, this equilibrium condition makes sure that EASGD does not overfit quickly. Finally, we introduce a new metric, temporal efficiency, to evaluate distributed optimization algorithms against each other

    JoeriHermans/dist-keras: Stale Gradient

    No full text
    Applied some production-critical methods

    Anonymizing spreadsheet data and metadata with anonymousXL

    No full text
    In spreadsheet risk analysis, we often encounter spreadsheets that are confidential. This might hinder adoption of spread- sheet analysis tools, especially web-based ones, as users do not want to have their confidential spreadsheets analyzed. To address this problem, we have developed AnonymousXL, an Excel plugin that makes spreadsheets anonymous with two actions: 1) remove all sensitive metadata and 2) obfus- cate all spreadsheet data within the Excel worksheets such that it resembles, untraceably, the original values
    corecore